States pollution Index in 2017

========================================================

by Saravanan Natarajan

========================================================

Abstract

The Annual summary file from Environmental Protection Agency’s Air Quality System for year 2017 consists of various data sets related to pollution. Ground-level ozone and airborne particles are the two pollutants that pose the greatest threat to human health. In this project we will explore the good and sensitive days in the year 2017 for the various States in USA.

Dataset

Get the data from the CSV file and explore

## 'data.frame':    1044 obs. of  19 variables:
##  $ State                              : Factor w/ 54 levels "Alabama","Alaska",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ County                             : Factor w/ 809 levels "Abbeville","Ada",..: 43 166 178 210 249 253 340 361 408 445 ...
##  $ Year                               : int  2017 2017 2017 2017 2017 2017 2017 2017 2017 2017 ...
##  $ Days.with.AQI                      : int  135 58 140 241 105 178 140 243 11 167 ...
##  $ Good.Days                          : int  111 49 128 219 100 118 125 103 11 143 ...
##  $ Moderate.Days                      : int  23 9 12 22 5 58 14 134 0 24 ...
##  $ Unhealthy.for.Sensitive.Groups.Days: int  1 0 0 0 0 1 1 6 0 0 ...
##  $ Unhealthy.Days                     : int  0 0 0 0 0 1 0 0 0 0 ...
##  $ Very.Unhealthy.Days                : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Hazardous.Days                     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Max.AQI                            : int  108 66 63 80 58 163 131 129 38 100 ...
##  $ X90th.Percentile.AQI               : int  61 54 50 50 48 62 51 76 30 58 ...
##  $ Median.AQI                         : int  41 27 41 40 38 45 38 53 18 42 ...
##  $ Days.CO                            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Days.NO2                           : int  0 0 0 0 0 0 0 2 0 0 ...
##  $ Days.Ozone                         : int  104 0 110 224 105 73 106 73 0 118 ...
##  $ Days.SO2                           : int  0 0 0 0 0 0 0 45 0 0 ...
##  $ Days.PM2.5                         : int  31 58 30 17 0 105 34 123 11 22 ...
##  $ Days.PM10                          : int  0 0 0 0 0 0 0 0 0 27 ...

Introduction to dataset

The data set consists of 1044 observations with 19 variables for the year 2017.

The complete details about dataset is provided in https://www.epa.gov/outdoor-air-quality-data/about-air-data-reports#aqi

Each row of the AQI Report lists summary values for one year for one county.

https://www.airnow.gov/index.cfm?action=aqibasics.aqi

The summary values include both qualitative measures (days of the year having “good” air quality, for example) and descriptive statistics (median AQI value, for example).

Univariate analysis

##             State            County         Year      Days.with.AQI  
##  California    : 50   Washington: 13   Min.   :2017   Min.   :  5.0  
##  Texas         : 47   Jefferson : 12   1st Qu.:2017   1st Qu.:178.0  
##  Ohio          : 41   Franklin  :  8   Median :2017   Median :212.0  
##  Indiana       : 40   Jackson   :  8   Mean   :2017   Mean   :202.7  
##  Florida       : 39   Lake      :  8   3rd Qu.:2017   3rd Qu.:270.0  
##  North Carolina: 38   Montgomery:  8   Max.   :2017   Max.   :304.0  
##  (Other)       :789   (Other)   :987                                 
##    Good.Days     Moderate.Days    Unhealthy.for.Sensitive.Groups.Days
##  Min.   :  5.0   Min.   :  0.00   Min.   : 0.000                     
##  1st Qu.:131.0   1st Qu.: 12.00   1st Qu.: 0.000                     
##  Median :168.5   Median : 27.00   Median : 0.000                     
##  Mean   :164.4   Mean   : 35.60   Mean   : 2.247                     
##  3rd Qu.:209.2   3rd Qu.: 50.25   3rd Qu.: 2.000                     
##  Max.   :302.0   Max.   :178.00   Max.   :90.000                     
##                                                                      
##  Unhealthy.Days     Very.Unhealthy.Days Hazardous.Days    
##  Min.   :  0.0000   Min.   : 0.00000    Min.   : 0.00000  
##  1st Qu.:  0.0000   1st Qu.: 0.00000    1st Qu.: 0.00000  
##  Median :  0.0000   Median : 0.00000    Median : 0.00000  
##  Mean   :  0.4138   Mean   : 0.04023    Mean   : 0.02107  
##  3rd Qu.:  0.0000   3rd Qu.: 0.00000    3rd Qu.: 0.00000  
##  Max.   :106.0000   Max.   :15.00000    Max.   :10.00000  
##                                                           
##     Max.AQI       X90th.Percentile.AQI   Median.AQI        Days.CO       
##  Min.   :   1.0   Min.   :  0.00       Min.   :  0.00   Min.   :  0.000  
##  1st Qu.:  77.0   1st Qu.: 48.00       1st Qu.: 33.00   1st Qu.:  0.000  
##  Median :  97.0   Median : 54.00       Median : 39.00   Median :  0.000  
##  Mean   : 105.7   Mean   : 55.08       Mean   : 35.57   Mean   :  0.274  
##  3rd Qu.: 119.0   3rd Qu.: 62.00       3rd Qu.: 42.00   3rd Qu.:  0.000  
##  Max.   :3439.0   Max.   :208.00       Max.   :135.00   Max.   :162.000  
##                                                                          
##     Days.NO2     Days.Ozone       Days.SO2       Days.PM2.5   
##  Min.   :  0   Min.   :  0.0   Min.   :  0.0   Min.   :  0.0  
##  1st Qu.:  0   1st Qu.:  0.0   1st Qu.:  0.0   1st Qu.:  0.0  
##  Median :  0   Median :122.0   Median :  0.0   Median : 44.0  
##  Mean   :  3   Mean   :121.1   Mean   : 10.6   Mean   : 59.7  
##  3rd Qu.:  0   3rd Qu.:192.0   3rd Qu.:  0.0   3rd Qu.: 91.0  
##  Max.   :190   Max.   :304.0   Max.   :304.0   Max.   :302.0  
##                                                               
##    Days.PM10      
##  Min.   :  0.000  
##  1st Qu.:  0.000  
##  Median :  0.000  
##  Mean   :  8.006  
##  3rd Qu.:  0.000  
##  Max.   :284.000  
## 
##  [1] "State"                              
##  [2] "County"                             
##  [3] "Year"                               
##  [4] "Days.with.AQI"                      
##  [5] "Good.Days"                          
##  [6] "Moderate.Days"                      
##  [7] "Unhealthy.for.Sensitive.Groups.Days"
##  [8] "Unhealthy.Days"                     
##  [9] "Very.Unhealthy.Days"                
## [10] "Hazardous.Days"                     
## [11] "Max.AQI"                            
## [12] "X90th.Percentile.AQI"               
## [13] "Median.AQI"                         
## [14] "Days.CO"                            
## [15] "Days.NO2"                           
## [16] "Days.Ozone"                         
## [17] "Days.SO2"                           
## [18] "Days.PM2.5"                         
## [19] "Days.PM10"

The summary shows various states and their counties for the year 2017, with 19 variable model fitting functions.

AQI Days plots

Histogram plot was used to view the various types of AQI, counts and range. The binwith and color was chosen based on the number of days in the year and AQI value of days

Pollutant measured

A daily index value is calculated for each air pollutant measured. The highest of those index values is the AQI value, and the pollutant responsible for the highest index value is the “Main Pollutant.” These columns give the number of days each pollutant measured was the main pollutant. A blank column indicates a pollutant not measured in the county or CBSA.

The graph infers that Ozone and PM2.5 were the Main pollutants. CO was least measured pollutant in the county.

Simple Univariate challenge

https://leanpub.com/exdata

Which counties in the United States have the highest levels of ozone pollution in year 2017?

Get the data directly by plotting Days.with.AQI and State, choose Days.with.AQI, since it is the highest of daily index value calculated for each air pollutant measured.

Create variable called state_ranking with log 10 of Days.with.AQI.

## 'data.frame':    1044 obs. of  3 variables:
##  $ State                 : Factor w/ 54 levels "Alabama","Alaska",..: 5 5 5 6 14 17 17 17 17 17 ...
##  $ County                : Factor w/ 809 levels "Abbeville","Ada",..: 274 385 610 395 46 172 424 504 553 580 ...
##  $ air_quality_data_log10: num  2.48 2.48 2.48 2.48 2.48 ...

Highly polluted States list

##         State    County air_quality_data_log10
## 1  California    Fresno               2.482874
## 2  California     Kings               2.482874
## 3  California Riverside               2.482874
## 4    Colorado  La Plata               2.482874
## 5       Idaho   Bannock               2.482874
## 6        Iowa   Clinton               2.482874
## 7        Iowa      Linn               2.482874
## 8        Iowa Muscatine               2.482874
## 9        Iowa Palo Alto               2.482874
## 10       Iowa      Polk               2.482874

Least polluted States list

##               State    County air_quality_data_log10
## 1035         Oregon Clackamas              1.0000000
## 1036           Utah  Garfield              1.0000000
## 1037     Washington Klickitat              1.0000000
## 1038     California Del Norte              0.9542425
## 1039        Montana    Powell              0.9542425
## 1040       Nebraska    Thomas              0.9542425
## 1041         Oregon   Wallowa              0.9542425
## 1042 South Carolina Abbeville              0.9542425
## 1043          Idaho    Custer              0.9030900
## 1044        Montana Roosevelt              0.6989700

Interestingly California is both in the list of high and low polluted state. The county Del Norte in California is least polluted. Counties Fresno, Kings, and Riverside in California were highly polluted.

Bivariate analysis

The missing parameter in the dataset is the mean value of Good and Sensitive days in the year 2017

## # A tibble: 20 x 5
## # Groups:   State, County [20]
##    State   County            Days.with.AQI mean_Good_AQI mean_Sensitive_A~
##    <fct>   <fct>                     <int>         <dbl>             <dbl>
##  1 Alabama Baldwin                     135         0.822           0.00741
##  2 Alabama Clay                         58         0.845           0.     
##  3 Alabama Colbert                     140         0.914           0.     
##  4 Alabama DeKalb                      241         0.909           0.     
##  5 Alabama Elmore                      105         0.952           0.     
##  6 Alabama Etowah                      178         0.663           0.00562
##  7 Alabama Houston                     140         0.893           0.00714
##  8 Alabama Jefferson                   243         0.424           0.0247 
##  9 Alabama Lawrence                     11         1.00            0.     
## 10 Alabama Madison                     167         0.856           0.     
## 11 Alabama Mobile                      181         0.768           0.0276 
## 12 Alabama Montgomery                  178         0.708           0.0112 
## 13 Alabama Morgan                      181         0.851           0.     
## 14 Alabama Russell                     133         0.842           0.     
## 15 Alabama Shelby                      263         0.954           0.     
## 16 Alabama Sumter                      170         0.924           0.     
## 17 Alabama Talladega                    60         0.783           0.     
## 18 Alabama Tuscaloosa                  171         0.614           0.00585
## 19 Alaska  "Aleutians East "            11         1.00            0.     
## 20 Alaska  "Anchorage "                181         0.818           0.00552

Plots

The values can be plotted against the number of days in the year having an AQI value

The plot provides the outlier values of both good days AQI and the sensitive days AQI.

Multivariate analysis

Sensitive group days vs Good days in each State

Scatterplot help us to view the number of Sensitive group days vs Good days for each State.

Integrate with States Map

Back to our least polluted and most polluted states, in this section we will plot the data in US map, the main challenge is EPA data don’t have the states lat and long, finally got the answer from

https://ekburchfield.files.wordpress.com/2017/05/web_scraping_in_r.pdf

https://s3.amazonaws.com/udacity-hosted-downloads/ud651/GeographyOfAmericanMusic.html

https://github.com/tidyverse/tidyverse/issues/66

Map data with State abbreviation

##            state.abb state.name state.area         x       y
## Alabama           AL    Alabama      51609  -86.7509 32.5901
## Arizona           AZ    Arizona     113909 -111.6250 34.2192
## Arkansas          AR   Arkansas      53104  -92.2992 34.7336
## California        CA California     158693 -119.7730 36.5341
## Colorado          CO   Colorado     104247 -105.5130 38.6777
##                state.division state.region Population Income Illiteracy
## Alabama    East South Central        South       3615   3624        2.1
## Arizona              Mountain         West       2212   4530        1.8
## Arkansas   West South Central        South       2110   3378        1.9
## California            Pacific         West      21198   5114        1.1
## Colorado             Mountain         West       2541   4884        0.7
##            Life.Exp Murder HS.Grad Frost   Area
## Alabama       69.05   15.1    41.3    20  50708
## Arizona       70.55    7.8    58.1    15 113417
## Arkansas      70.66   10.1    39.9    65  51945
## California    71.71   10.3    62.6    20 156361
## Colorado      72.06    6.8    63.9   166 103766

Map data with county details

##        long      lat group order  region subregion   State  County
## 1 -86.50517 32.34920     1     1 alabama   autauga Alabama Autauga
## 2 -86.53382 32.35493     1     2 alabama   autauga Alabama Autauga
## 3 -86.54527 32.36639     1     3 alabama   autauga Alabama Autauga
## 4 -86.55673 32.37785     1     4 alabama   autauga Alabama Autauga
## 5 -86.57966 32.38357     1     5 alabama   autauga Alabama Autauga
## 6 -86.59111 32.37785     1     6 alabama   autauga Alabama Autauga

Merge mean_Good_AQI data with map data

##       State  County Days.with.AQI mean_Good_AQI mean_Sensitive_AQI
## 102 Alabama Baldwin           135     0.8222222        0.007407407
## 109 Alabama Baldwin           135     0.8222222        0.007407407
## 40  Alabama Baldwin           135     0.8222222        0.007407407
## 95  Alabama Baldwin           135     0.8222222        0.007407407
## 81  Alabama Baldwin           135     0.8222222        0.007407407
## 53  Alabama Baldwin           135     0.8222222        0.007407407
## 66  Alabama Baldwin           135     0.8222222        0.007407407
## 104 Alabama Baldwin           135     0.8222222        0.007407407
## 54  Alabama Baldwin           135     0.8222222        0.007407407
## 79  Alabama Baldwin           135     0.8222222        0.007407407
## 64  Alabama Baldwin           135     0.8222222        0.007407407
## 78  Alabama Baldwin           135     0.8222222        0.007407407
## 89  Alabama Baldwin           135     0.8222222        0.007407407
## 65  Alabama Baldwin           135     0.8222222        0.007407407
## 67  Alabama Baldwin           135     0.8222222        0.007407407
## 92  Alabama Baldwin           135     0.8222222        0.007407407
## 80  Alabama Baldwin           135     0.8222222        0.007407407
## 98  Alabama Baldwin           135     0.8222222        0.007407407
## 96  Alabama Baldwin           135     0.8222222        0.007407407
## 103 Alabama Baldwin           135     0.8222222        0.007407407
##          long      lat group order  region subregion
## 102 -87.93757 31.14599     2    53 alabama   baldwin
## 109 -87.93183 31.15171     2    54 alabama   baldwin
## 40  -87.92037 31.15171     2    55 alabama   baldwin
## 95  -87.90318 31.16318     2    56 alabama   baldwin
## 81  -87.89173 31.16318     2    57 alabama   baldwin
## 53  -87.88026 31.15171     2    58 alabama   baldwin
## 66  -87.87453 31.16318     2    59 alabama   baldwin
## 104 -87.87453 31.18036     2    60 alabama   baldwin
## 54  -87.86308 31.18036     2    61 alabama   baldwin
## 79  -87.85735 31.18609     2    62 alabama   baldwin
## 64  -87.86308 31.19182     2    63 alabama   baldwin
## 78  -87.87453 31.19755     2    64 alabama   baldwin
## 89  -87.86880 31.20901     2    65 alabama   baldwin
## 65  -87.84016 31.20901     2    66 alabama   baldwin
## 67  -87.83443 31.22047     2    67 alabama   baldwin
## 92  -87.85162 31.24339     2    68 alabama   baldwin
## 80  -87.83443 31.24339     2    69 alabama   baldwin
## 98  -87.82870 31.24912     2    70 alabama   baldwin
## 96  -87.82870 31.26058     2    71 alabama   baldwin
## 103 -87.81724 31.27777     2    72 alabama   baldwin

Plot the data

Plotted the county boundaries with the state name, the region most polluted was captured as darker region it means lower percentage of good days in the year. The colour scales were captured in the bottom.

Correlation between Good days AQI and Sensitive Days AQI

Complete by finding the correlation between good AQI days and the sensitive AQI days.

## [1] "-0.66"

The correlation value r = -0.66

Final Plots and Summary

Number of days in the year having an AQI value 0 through 50 means the good days for each counties. Similar AQI days in each counties will help us to get the healthy states in USA. Florida and District Of Columbia were chosen to plot the good AQI values for each counties.

http://paulorenato.com/index.php/171

Binwidht and the scales in facet_wrap helped to get the clear view of good AQI days in each county.

In the above graph Florida have the maximum y scale of 4, which means 4 counties have 131 good AQI days, the counties were Miami-Dade, Palm Beach, Pinellas and Sarasota. Same applicable for 149 good AQI days.

District Of Columbia state only have one county with 157 good AQI days in the year 2017.

Map plotting of Sensitive AQI days

To get the bird’s eye view of pollution in each state go back to map plot, in order to get better picture of pollution in each state, need to consider the unhealthy days for sensitive groups with AQI level 101 to 150.

Explore further based on state area

##   state.abb state.name state.area
## 1        AZ    Arizona     113909
## 2        CA California     158693
## 3        CO   Colorado     104247
## 4        MT    Montana     147138
## 5        NV     Nevada     110540
## 6        NM New Mexico     121666

The green dot indicates the top biggest states in US

Reflection

While starting the project want to plot the pollution levels of various states in USA, after exploring the available variables in the datasets the direction changed to find the least polluted and most polluted states.

The simple univariate challenge gave a lots of thought about how to handle the data.

The next interesting part of finding is the correlation between sensitive AQI days with respect to good AQI days and it reflected a strong downhill (negative) linear relationship between Good days AQI and Sensitive days AQI.

Further exploring the number of days good AQI (value 0 through 50) and number of days unhealthy for sensitive AQI (value 51 through 100), it was interesting to plot the data in US map, so that we can get the complete view of pollution in each counties in US.

There is lots of room to further explore the same data set for relationship between ozone and PM2.5 pollutant, identifying which is the most polluting component in each state, how the population in each state will impact the pollution and vice versa.